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Abstract 

We consider a Bayesian hierarchical version of the normal theory general linear model which is 
practically relevant in the sense that it is general enough to have many applications and it is not 
straightforward to sample directly from the corresponding posterior distribution. Thus we study a 
block Gibbs sampler that has the posterior as its invariant distribution. In particular, we establish 
that the Gibbs sampler converges at a geometric rate. This allows us to establish conditions for a central 
limit theorem for the ergodic averages used to estimate features of the posterior. Geometric ergodicity 
is also a key component for using batch means methods to consistently estimate the variance of the 
asymptotic normal distribution. Together, our results give practitioners the tools to be as confident in 
inferences based on the observations from the Gibbs sampler as they would be with inferences based 
on random samples from the posterior. Our theoretical results are illustrated with an application to 
data on the cost of health plans issued by health maintenance organizations. 
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1 Introduction 

The flexib ility of Bayesian hierarchical mode l s makes them widely applicable. One of the most po pular 



(see, e.g., iGelman. Carlin. Stern and Rubinl . 12004 ISpiegelhalter. Thomas. Best and Lunn 



20051 ) is a 



version of the usual normal theory general linear model. Let Y denote an JV x 1 response vector and 
suppose (3 is a p x 1 vector of regression coefficients, u is a k x 1 vector, X is a known N x p design 
matrix having full column rank, and Z is a known N x k matrix. Then for r, s,t G {1, 2, • • • }, the 
hierarchy is 

Y\p, u, X R , X d ~ Njy + Zu, A^) 

r 

/8|u, X R , X d ~J2 
1=1 

u\X R , X d ~ N fc (0, A^Ifc) ^ 

Ai? ~ c^-Gamma (ryi, r^) 

j'=i 
t 

Ad — ^ ipi Gamma (dg , ) 
;=i 

where the mixture parameters 7^, </>j, and "0; are known nonnegative constants which satisfy 

r s t 

2=1 3=1 ( = 1 

and we say W ~ Gamma(a, 6) if it has density proportional to w a ~ 1 e~ bw for w > 0. Further, we 
require /3 and u to be a posteriori conditionally independent given A^, Ap, and y which holds if and 
only if X T Z = 0. Finally, bi € R and positive definite matrix B arc known and the hyperparameters 
r\ji, 7j'2, dji, and cfe are all assumed to be positive. 

Let £ = (u T ,/3 T ) T and A = (A^Ad) 7 - Then the posterior has support X = R k+P x and a 
density characterized by 

ir(£,\\y)ocf(y\£,\)f(£\\)f(\) 

where ?/ is the observed data and / denotes a generic density. Posterior inference is often based on 
the expectation of a function g : X — s- R with respect to the posterior. For the model (fTJ) we can only 
rarely calculate the expectation 

E. 5 (£,A) := f <?(£, A)7r(£, X\y)d£dX , 
J x 

since it is a ratio of two potentially high-dimensional intractable integrals. Hence inference regarding 
the posterior may require Markov chain Monte Carlo (MCMC) methods. We consider two-component 
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Gibbs sampling which produces a Harris ergodic Markov chain $ = {(£o> Ao), (£1, Ai), • • ■ } with invari- 
ant density 7r(£, X\y). 

Suppose -Efflffl < oo and we obtain n observations from the Gibbs sampler. Then a natural estimate 
of E^g is g n = n^ 1 J^ILo g(£i, Ai) since g„ — > E^g with probability 1 as rt — > oo. In other words, the 
longer we run the Gibbs sampler, the better our estimate is likely to be. However, this gives no 
indication of how large n must be to ensure the Monte Carlo error g n — E^g is sufficiently small. The 
size of this error is usually judged by appealing to its approximate sampling distribution via a Markov 
chain central limit theorem (CLT), which in the cases of current interest takes the form 



Vn(g n — E^g) -> N(0, a 2 ) as n — > oo 



(2) 



where a 2 £ (0, oo). Due to the serial correlation in $, the variance a 2 will be complicated and require 
specialized techniques (such as batch means or spectral methods) to estimate consistently with <r„, say. 
Suppose cr„ — > a 2 with probability 1 as n — > oo. In this case, an asymptotically valid Monte Carlo 
standard error (MCSE) is given by & n /^/n. In turn, this can be used to perform statistical analysis of 
the Monte Carlo error and to implement rigorous sequential stopping rules for deter mining the length 
of simulation required (see 



Flegal. Haran and Jones 



2008; 



Jones and Hobert 



2001) so that the user 



will have as much confidence in the simulation results as if the observations were a random sample 
from the posterior; this is described in more detail in Section 01 

Unfortunately, for Harris ergodic Markov chains simple moment conditions are not sufficient to 
ensure an asymptotic distribution for the Monte Carlo error or that we can consistently estimate a 2 . 
In addition, we need to know that the convergence of $ occurs rapidly. Thus, one of our goals is 
to establish verifiable conditions under which the Gibbs sampler is geometrically ergodic, that is, it 
converges to the posterior in total variation norm at a geometric rate. 

We know of three papers that address geometric ergo dicity of Gibbs samplers i n the context of the 
norma l theo ry linear model with proper pr i ors. T hese are iHobert and Geverl (|1998[ ) , iJones and Hobert 



(|2004l ), and iPapaspiliopoulos and Roberts ( 2008 ) . The linear model we consider substantively differs 



from those in 



Papaspiliopoulos and Roberts 



2008) in that we do not assume the variance compo- 



nents are known. Our m odel is also much more gener al than the one-way random effects model in 

)_. Gibbs sampling for the balanced one-way 



Hobert and Geverl (| 19981 ) and 



Jones and Hobert (200 



random effects model is also considered in iRosenthall (|1995[ ) where coupling techniques were used to 
establish upper bounds on the total variation distance to stationarity. However, these results fall short 
of establishing geometric ergodicity of the associated Markov chain. 

The rest of this paper is organized as follows. Gibbs sampling for the Bayesian hierarchical general 
linear model is discussed in Section[2]and geometric ergodicity for these Gibbs samplers is established in 
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Section [3J Conditions for the CLT © are given in Section @] along with a description of the method of 
batch means for estimating the variance of the asymptotic normal distribution. Finally, our results are 
illustrated with a numerical example in Section [3J Many technical details are deferred to the appendix. 



2 The Gibbs Samplers 

The full conditional densities required for implementation of the two-component block Gibbs sampler 
are as follows: Conditional on £ and y, A follows the distribution corresponding to density 

s t 

f(\\£,y) =Y,Y, ( t , 3^h3(^y)hi(*D\(;,y) (3) 

j=l 1=1 

where y) denotes a Gamma(rji+iV/2, rj2+fi(£)/2) density and /2/(-|£, y) denotes a Gamma(dji + 

k/2, di2 + «2(£)/2) density with 

:= (y-Xp-Zufiy-Xp-Zu), v 2 (0 := u T u . (4) 



Alsc 



where 



111; 



i=l 

(x R z T z + \ D i k y 1 o 

o ( K \ R x T x + By 1 

\ R {X R Z T Z + X D I k y 1 Z T y 

(X R X T X + By 1 (MX T y + Bb l 



(5) 



These follow from our assumption that X T Z = 0. 

There are two possible update orders for our 2-componcnt Gibbs sampler. First, let $i denote the 
Markov chain produced by the Gibbs sampler which updates £ followed by A in each iteration so that 
a one-step transition looks like (£', A') — > (£, A') — > (£, A). Then the one-step Markov transition density 
(Mtd) for $i is 

fci(£,A|£',A') = /(£|A',y)/(A|£,y). 

Similarly, let $2 denote the Markov chain produced by the Gibbs sampler which updates A followed by 
£ in each iteration so that the one-step transition is (£', A') — > (£', A) — > (£, A). Then the corresponding 
Mtd is 

fc 2 (£,A|£',A') = /(A|£',y)/(£|A,y). 
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Also, let = {£o: £ii • • • } an d $a = {^o, Xx, • ■ ■ } denote the associated marginal chains with Mtds 

km')= [ f(x\e,y)m\x,y)dx 

Jm. 2 + 

and 

**(A|A')= I f(Z\X',y)f(X\Z,y)dl;, 

respectively. 

Because the Mtd's are strictly positive on the state space it is straightforward to show that $1 
and $2 arc Harris crgodic. The posterior density 7r(£, X\y) is invariant for $1 and $2 by construction. 
Moreover, it is easy to see that both chains are Feller. Similarly, <£>£ and $a are Harris ergodic and 
Feller with invariant densities the marginal posteriors 7r(£|y) and n(X\y), respectively. Hence all four 
Markov chains converge in total variation norm to their respective invariant distributions. In the next 
section we establish conditions under which this convergence occurs at a geometric rate. 



3 Geometric Ergodicity 

3.1 Establishing Geometric Ergodicity 

Our main goal in this section is to establish conditions for the geom etric ergodicity of ^1 and B efore 



doing so it is useful to acquaint ourselves a concept introduced bv lRoberts and Rosenthal! (|200l|). Let 
X = {X n , n > 0} be a Markov chain on a space X and Y = {Y n , n > 0} a stochastic process on 
a possibly different space y. Then Y is de-initializing for X if, for each n > 1, conditiona lly on Y n 



Roberts and Rosenthal! ( 2001 ) use this 



it follows that X n is independent of Xq. Roughly speaking, 

concept to show that Y controls the convergence properties of the Markov chain X 

To establish the geometric ergodicity of $1 and $2 it suffices to work with the marginal chains <j>f and 



Roberts and Rosenthal 



< j>A- F irst, $5 is de-initializing for $1 and $a is de-initializing for $2- Results in 
(|200l! ) imply that if (^a) is geometrically ergodic, so is $1 ($2)- Further, $1 and $2 are co-de- 
initializing. Hence if one is geometrically ergodic, then they both are and Lemma 13-11 follows directly. 



Lemma 3.1. If or $a is geometrically ergodic, then so are $1 and $2- 

Accordingly, we can proceed by studying the convergence behavior of the marginal chains. We 
establish geometric ergodicity for <£>£ by establishing a drift condition. That is we need to specify a 
function V : M. k+P — > R+ and constants < 7 < 1 and L < 00 such that 



E[V(£) I £} < jV(£) + L for all f e M. k+P 



(6) 
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where the expectation is taken with respect to the Mtd fcg. Let W(£) = 1 + V(£), b = L + 1 — 7 and 
C = U : < 46/(1 - 7)}. Uones and Hobertl (|2004l Lemma 3.1) show that equation © implies 

A^(0 := E[W(0 I £'] - ^(O < -^WiO + 2bl(e g C) . 

Here AW(£') is the dri/f, V (or W) is a dri/t function and 7 a drt/S rate. If £' ^ C the expected change 
in W is negative so will tend to "drift" to C, that is, where the value of W is small. Moreover, it 
also does it in such a way that the drift towards C is faster when 7 is small. On the other hand, if 
7a 1 the drift will be slow. Thus the value of 7 is intimatel y connected to the conve rgence rate of 

Section 3.3). 



Jones and Hobert (2001 



for a thorough accessible discussion of the connection see 
Hence examination of 7 can give us some intuition for the convergence behavior of $£. However, drift 
functions arc not unique so this examination generally will not lead to definitive conclusions. 

A function V : M. k+P — > K is unbounded off compact sets if the set {£ € : V(£) < d} is compact 

for any d > 0. Note that the maximal irreducibility measure for is equivalent to Lebcsgue on R fc+P 
so that its support certainly has a non-empty interior. The sufficiency of drift for geometric ergod icity 
now follows easily from Lemma 15.2.8 and Theorems 6.0.1 and 15.0.1 of lMevn and Tweedid (|1993l ) and 
Lemma 13.11 



Proposition 3.1. Suppose © holds for a drift function that is unbounded off compact sets. Then $^ 
is geometrically ergodic and so are $1 and $2- 

In Section ^. 2l we develop conditions on our Bayesian model ([T]) which arc sufficient for the conditions 
of Proposition 13. II 



3.2 Drift for $ 5 

For all j e {1, • ■ • , s} and I G {1, ■ • • , f}, define constants 



J 33 



2r jX + N - 2 



">12 



2d n + k — 2 



and (5;4 = 



N t 



(7) 



4(2rji + N - 2) ' " ~" 1 2dn+k-2 

Also, let Xi and z% denote the ith rows of matrices X and Z, respectively, and let yi and Ui denote the 
tth elements of vectors y and it, respectively. Next, for i £ {1, • • • , r} define 

JV fc 

G»(A) := [Ej (y m - x m f3 - z m u\X,y)] + ^ [Ej (u TO |A, y)] 2 



m— 1 



m — 1 



where E.j denotes expectation with respect to the iVfc + p(mi,E 1 ) distribution. 
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Proposition 3.2. Assume there exists some K < oo such that G,(A) < K for all A G 
i G {1, • • • , r}. £e£ V(£) = Ui(£) + «2(£) where i>i(-) and W2O) are defined at ([!]). 

1. If Z T Z is nonsingular, dn > 1 /or all I E {!,••• ,0; anc ^ 



ana 1 



rj-i > V 0.5 ^ z i (Z T Z)- 1 ^f - AT + 2^ /or a« j e {1, ■ 
then ^ holds for drift function V(£) mitt max^ i{<5ji, #; 2 } _■ 7 < 1 an d 

N 



2. If for all j G {1, 



i = V] a^iB 1 xf + max {2r ]2 Sji + 2di 2 Si 2 } + K . 

l=l 

, s} and I G {1, ■ • ■ , t} 
rj! > V 0.5 



TV 



d n > 0.5 



0.25 ^^(A^X) -1 ^ -iV + 2 

JV 

2 + E 



T 



and 



then ^ holds for drift function V(£) witt max^/j^s, £(4} < 7 < 1 and 



1 W 

L = - x i B ^ + max {2^-2^3 + 2d /2 <5 M } + K 



i=l 



Proof. See Appendix IA.2I 



□ 



Notice that the formulations of 7 given by Proposition 13.21 depend on the Bayesian model setting 
through 5ji, 612, 6ja, and S14. Therefore, the drift and convergence rates of the $j marginal chain 
(hence the Gibbs samplers) may be sensitive to changes in the dimension k of u, the total number 
of observations N, or the hyperparameter setting. However, it is interesting that the dimension of /3, 
which is p, has only an indirect impact on this result. Specifically, when Z T Z is nonsingular the value 
of p has no impact, that is, the drift rate is unaffected by changes in p. Of course, changing p does 
mean that X changes which may impact 5j3 which in turn can change the permissible hyperparameters 
rji and the drift rate when Z T Z is singular. 

Example 3.1. Consider the balanced random intercept model derived from |T]) for k subjects with m 
observations each. In this case, Z = Ik®l m where ® denotes the Kronccker product and l m represents 
a vector of ones of length m. Hence Z T Z = mlk is nonsingular. Define 

k k 



M, 



N,k 



max 
/ 



2r,i + N - 2 ' 2dn + k-2 
If dn > 1 for all /, Condition 1 of Proposition 13.21 establishes drift rate M^,k < 7 < 1- Notice that 
-Mjv.fc — > 1 as fc — > 00 and hence 7 — > 1 as well. This supports our intuition that the Gibbs sampler 
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should converge more slowly as its dimension increases. On the other hand, if k is held constant but 
m increases so that N = km — > oo, then 

k 



< 7 < 1 



2d n +k-2 

Thus increasing the number of observations per subject does not have the same negative, qualitative 
impact as increasing the number of subjects. Finally, Mn,): — > 1 (hence 7 — > 1) when k is held constant 
and dn — > 1 for any I. 

Consider the condition that Gi(X) < K for all A G and i G {1, • • • , r}. The following result 
establishes this condition for an important special case of ([T]). In our experience it is often straightfor- 
ward to show that Gi is bounded and, if desired, numerical optimization methods yields appropriate 
K. 

Proposition 3.3. Assume hi = for all i G {1, • • ■ , r}. 

1. If Z = 0, then 

Gi(X R ) < y T y 

for all A/? G M+ and i G {1, • ■ ■ , r}. 
^. // Z T Z is nonsingular, then 

G i (X)<y T y + y T Z(Z T Z)- 2 Z T y 

for all A G and i G {1, ■ ■ ■ , r}. 

Proof. See Appendix IA.2I □ 

We are now in position to state conditions on ([T]) guaranteeing geometric ergodicity of the Gibbs 
samplers $1 and <&2. This follows easily from Propositions 13.11 and 13.21 if the drift function V(£) = 
MO+MO is unbounded off compact sets on R k+ P. Define S = {£ G R fc+P : V(f) = wi(0 + "2(0 < ^} 
where <i > 0. Notice that F is continuous so it is sufficient to show that, on S, \{3i\ is bounded for 
i G {1,2, ... f p} and is bounded for j G {1,2,..., k}. Clearly, S G S2 = {£, ■ u Tu < d} and it is 
obvious that each \uj\ is bounded on S2 hence also on S. Moreover, note — > 00 as \uA — > 00. Given 
that the \uj\ are bounded it is easy to see that V\ — > 00 as |^| — > 00. Putting this together we see that 
V is unbounded off compact sets. The main result of this section follows. 



Theorem 3.1. Assume the conditions of Proposition HOI Then the Markov chain and the Gibbs 
samplers $1 and $2 are geometrically ergodic. 



8 



4 Interval Estimation 



Suppose we want to estimate an expectation E^g := J x A)7r(£, X\y)d^d\ where g is real- valued and 
7r-integrable. It is straightforward to estimate E^g with g n := n _1 Y^ii=o d(£ii ^»)- A key step in the 
statistical analysis of g n is the assessment of the Monte Carlo error g n — E^g through its approximate 
sampling distribution. 

Theorem 4.1. Assume the conditions of Theorem \S.l\ If E v \g\ 2+e < oo for some e > 0, then there is 
a constant a 2 G (0, oo) such that for any initial distribution 



\fn[g n — E^g) — > N(0, <7„) as n — > oo . 
The pro of of this theorem follow s easily from Theorem l3.11 Th eorem 2 of 



Section 1 of 



Fleeal and Jonea (120091) . Roughly spea king, results in 



(|2002h . bones. Haran. Caffo and Neatb] |2006f ) and 



Chan and Geverl (I1994T) and 



Hobert. Jon 



Bednorz and Latuszvnski 



es. Presnell and Rosenthal 



20071 ) show that, under 



conditions comparable to those required for Theorem 14.11 techniques such as regenerative simulation 



and batch means c an be used to construct an estimator of a 2 say a 2 , such that a„ 



oo 



almost surely. See iFlegal and Jonesl (|2009l ) for the conditions required to ensure consistency of over- 



lapping batch means and spectral estimators of a 2 . 

Before giving a precise discussion of the conditions for consistency we need a preliminary definition 
and result. Let X C R d for d > 1 and k : X x X — > [0, oo) be an Mtd with respect to Lebesgue measure. 
Suppose there exists a function s : X — >• [0, 1) and a density q such that for all x, x' € X 

k{x\x') > s(x')q(x) . 

Then we say there is a minorization condition for k. 
Lemma 4.1. Let C C X be compact and assume c > where 



c= / k(x\x*)dx 
Jc 

and some x* G X . If for each x' , k(-\x') is positive and continuous on C , then there exists a minorization 
condition for k. 



Proof. The proof follows a technique first introduced by 
Then for all x G C 



Mvkland. Tiernev and Yul |l995f l. Fix x* G X. 



k(x\x') = ^flk{x\x') > 
k[x\x*) 



inf 



k{x\x') 



xac k{x\x*) 



k{x\x*) 
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Let x m be the point where the infimum is achieved. Then the minorization follows by setting q(x) 
c~ 1 k(x\x*)I(x £ C) and 



s(x') 



^ k(x m \x ) 
k{x m \x*~) 



□ 



The conditions of Lemma 14.11 are not the weakest that ensure the existence of a minorization 
condition but they will suffice for our purposes. In particular, it is straightforward to use Lemma |4. II to 
see t hat there exists a minorizati o n con dition for both ki and &2 the Mtd's for 4>i and $2, respectively. 
Also, iHobert. Jones and Robert! (|2006|) derived an explicit closed form expression for a minorization 



for a Markov chain for which $2 is a special case 



The consistency results for &1 in 



and 



Flegal and Jonesl (|2009( K 



Hobert et al 



((2002J) 



Jones et al 



pood ) 



Bednorz and Latuszvnskil ({20071 ) all require that a minorization condition hold. The efficacy of 
regenerative simulation is utterly dependent upon the minorization while minorization is irrelevant 
to the implementation of batch means and spectral methods. That is, the minorization is purely a 
technical device used in the proofs of consistency for batch means and spectral estimators. 

We use the method of batch means in Section [5] to estimate a 2 . Let n be the simulation length, 
b n = \ji a \ and a n = [n/b n \. Now define 



Y, 



g(£i,K) for j = 1, . . . ,a Tl 



i=(j-l)b n 



The batch means estimate of a 2 is 



(8) 



Puttin g together our Theorem l3 . 1 l and Lemma POl with results in 
(|2007h we have the following consistency result. 



Jones et al 



(|2006h and 



Bednorz and Latuszvnski 



Theorem 4.2. Assume the conditions of Theorem VS '. 1\ If Ei T \g\ 2+e < 00 for some e > set e = e\ +62 
and let (1 + ei/2)~ 1 < a < 1, then for any initial distribution for either $1 or $2 we have that — » a 2 
with probability 1 asn->oo. 

Using Theorems 14.11 and 14.21 we can use ((8]) to form an asymptotically valid confidence interval for 
E^g in the usual way 



9n ± r a„-l —j= 



(9) 



where i a „— 1 is a quantile from a Student's t distribution with a n — 1 degrees of freedo m. Moreover 



we can use batch means to implement the fixed-width methods of 



Jones et al 



(2006) to determine 
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how long to run the simulation. Following 



Flegal et al 



(|2008h let e be the desired half-width of the 



interval in © and n* be a minimum simulation size specified by the user. Then we can terminate the 
simulation the first time 

t an -i^= + £l{n > n*) + -<£. 
\Jn n 

The final interval estimate will be asymptotically valid in th e sense that t he int er val will have the 
desire d coverage probabi l ity fo r suf ficiently sma l l e; s ee also 



Flegal et al 



()2009UGlvnn and Whittl (119921) and 



Jones et al 



(2008) 



Flegal and Jones 



(200G). 



5 A Numerical Example 

In this section we illustrate our theoretical results in the analysis of US government health maintenance 
organization (HMO) data. To study the cost-effectiveness of transferring military retirees from a 
Defense Department health plan to health plans for government employees, information was gathered 
from 341 state-based health maintenance organizations (HMOs). These plans represent 42 states, 
the District of Columbia, Puerto Rico, and Guam. An HMO plan's cost is measured by its monthly 
premium for individual subscribers. Two possible factors in this cost are (1) the typical hospital 
expenses in the state in which the HMO operates; and (2) the region in which the HMO operates. In 
Figure [TJ the individual monthly premiums for the 341 HMOs are plotted against the average expenses 
per admission in the state of operation (both in US dollars). 

Let yi denote the individual monthly premium of the ith HMO plan. To analyze these data 



Hodges 



(|l998l ) considered a Bayesian version of the following frequentist model: 

Vi = A) + + P2X12 + £1 (10) 

where the £i are iid N (0, A^ 1 ), xu denotes the centered and scaled average expenses per admission in 
the state in which the ith HMO operates, and 2^2 is an indicator for New England. The xu values were 
centered and scaled to avoid collinearity. Specifically, if in is the raw average expense per admission 
and xi is the overall average expense per admission, xu = {in — xi)/1000. 

We perform a Bayesian regression analysis based on the following hierarchical version of (fT0|) : 

j/IAAfl-Njv (XP,\^I N ) 

0\Xr ~ N 3 (b, B^ 1 ) (11) 
Xr ~ Gamma(?'i, j^) 

where N = 341, y is the N x 1 vector of individual premiums, fj = (/?o, /3i, fc) is the vector of regression 
parameters, and X is the N x 3 data matrix. 
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Figure 1: Individual monthly HMO premiums are plotted against the average expenses per admission in 
the state in which the HMO operates. Solid circles represent states in New England. 



Complete specification of the model requires values for hyperparameters (b, B,r!,r 2 ). We use an 
approach which is empirical Bayesian in spirit. To this end, we fit (fTU)) using least squares regression. 
The results are summarized in Table [TJ 

Accordingly we chose the following prior mean and covariance matrix for (3: 



( 



b = 



\ 



164.989 
3.910 
32.799 



\ 



and 



B- 1 = 



( 2 \ 

3 
36 



where b is the vector of least squares estimates and the diagonal elements of B 1 are reflective of the 
corresponding squared standard errors in Table [T] Next, we set the prior mean and variance for \r to 

E(A fl ) = !1 = JL = 0.00177; and 

Var(A fl ) = V -\ = 1 

where MSE is the least squares estimate of A^ 1 given in Table [T] Solving for r\ and r 2 gives r\ = 
3.122 * 10~ 6 and r 2 = 0.00177. 
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Table 1: Least squares regression results for ( I10|) . 



Parameter 


Estimate 


Standard Error 


A) 


164.989 


1.322 


ft 


3.910 


1.508 


& 


32.799 


5.961 



N = 341 

degrees of freedom = 338 

MSE = SSE/338 = Y*Ll(Vi ~ y*)7 338 = 23.79 2 

Since (Ti"Tj) does not contain any random effects, it follows from Theorem 13 . 1 1 that the Gibbs sampler 
for 7r(/3, XrIu) is geometrically ergodic since 

n > V 0.5 [2 - N] = 

and for any A^ € R + the function G(Xr) is bounded (recall Proposition I3.3P where 

N 

g(\ r ) = Y, Hvi - ^l A «> y)f = (y- *e(/3|a r , y )) T ( y - xe(p\\ r , y )) 

i=l 

andE(/3|A Rl </) = (A fl X T X + S)- 1 (A J? X T 2/ + 

Consider estimating the posterior means of fa , /?i , and fa . By Lemma IA.6[ the fourth posterior 
moments of these parameters arc finite. Thus Theorems 14.11 and 14.21 in conjunction with geometric 
crgodicity guarantee the existence of CLTs and consistent estimators of the as ymptotic var i ance via 



batch means with b n = [n°- 501 \ which was chosen based on recommendations in I Jones et al.l (]2006f ). 

To begin our analysis of the posterior means, we ran independent Gibbs samplers from a variety of 
starting values and updated \r followed by (3 in each iteration. For each chain, we required a minimum 
simulation length of 1000. At each successive iteration, we calculated the approximate half-widths of 
the Bonferroni-corrccted 95% intervals for the posterior means of fa, fa, and fa, 

tin 1 
*o„-l, 0.025/3^^ "I • 

sjn n 

Simulation continued until the half-widths for fa, fa, and fa, were below 0.10, 0.02, and 0.10, re- 
spectively. The results were consistent across starting values. That is, Gibbs samplers with different 
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starting values produced similar estimates and required similar simulation effort to meet the above 
specifications. Here, we present the results for the chain started from the prior means of /3 and Xr, 
(b,r±/r2). Under this setting, the interval half-width thresholds were met after 16831 iterations. The 
corresponding estimates of the posterior means are reported in Table [2] with standard errors. 

Table 2: Estimates of posterior means with corresponding standard errors. 



Parameter 


Estimate 


Standard Error 


A) 


165.0 


0.007 


A 


3.9 


0.008 


& 


32.8 


0.032 



A Appendix 

A.l Proof of Proposition 13.21 

We wi ll require the following general results in our proof. A proof of Lcmma fA.ll is given in 



Henderson and Searle 



(|198ll ) and Lemma IA.2I follows from the convexity of the inverse function. 



Lemma A.l. Let A be a nonsingular n x n matrix, B be a nonsingular s x s matrix, U be an n x s 
matrix, and V be an s x n matrix. Then 

{A + UBV)- 1 = A' 1 - A~ 1 U(B~ 1 + VA- l U)- l VA- 1 . 

When U = V this implies 

x T (A + UBVy 1 x < x T A- 1 x 

for any nxl vector x. 

Lemma A. 2. Let x be an m x 1 vector. Also, let A and B be nonsingular, m x m matrices. Then 

x T (A + B)- l x < \x T (A' 1 + B- 1 ) x . 
We begin the proof of Proposition 13.21 Recall that 

vi(0 : = {V-XP- Zu) T {y-XP- Zu), and u 2 (£) := u T u . 
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We must show that for all f g R k+P 

E* t [V(0|C'] - E fe5 M£) + va(OI€'] < 7^(0 + i 

where the constants 7 and L are given in the statement of Proposition 13.21 Let Ej and Var^ denote 
expectation and variance with respect to the Nk+ P (rrij, distribution. Similarly, let Ej; and Var^; 
denote expectation and variance with respect to density 



defined by ([3]). Notice that 



E fe5 ry(Oin = e[e(v(0|a)K'] = ^EE^» PWOW U'] (12) 

j=i ;=i t=i 

where the first equality holds by the construction of $£. Thus we focus on the Eji [Ej(V(£)|A) | £'] in 
the next 3 lemmas. 

Lemma A. 3. Suppose Z T Z is nonsingular. Then for all 

Eji [AM£)|A)U'] + 

N 

+ EZ + 2r j2<5jl- 



where 



L\ — Ej, 



N 



m— 1 



EI [-^(fm - ^m/? - Z m u\X)Y 
_m— 1 

Proof. Consider the inner expectation Ei(ui(£)|A). For any i we have 

N 

Ei(«i(f)|A) = EI E * " ^ " 2 »> U ) 2 I A ] 

m— 1 

TV AT 

= EI I E i(y™ ~ ^n/ 3 ~ z m u|A)] 2 + EI Varj(y m - x m /3 - z m u\\) 



(13) 



and 



Vari(y m - x m (3 - z m u\X) = x m (\ R X T X + B) 



- 1 ™T 



< x m B 1 xJ n + X^z m {Z T Z} zj n 



by Lemma lA.U It follows that for any i, j, / we have 



E j7 [EiKCOIA)^'] <E j7 



A' 



EI [Ei(j/ m - x m /3 |A)] 

m— 1 

AT 

(a^ioEM^ 



*4 + 



EI ^m- 8 1 = 



m= 1 



m=l 
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Combining this with the fact that 
gives 



i,^ _ 2r j2 +v 1 ^') M2r^ +«i(Q) 



E iZ [Ei( Wl (OIA) | e']<E 



N 



^ [Ej(y m - x m /3 - z m u|A)] z 



JV 



+ Sj! (2r j2 + Vl (£')) + ^ x m B \t 

m— 1 

= ^lUl(C') + ^1 ■ 

Lemma A. 4. For any i,j,l 

En [EiivxiOW W\ < &izvxtf) + (5 l4 - 5 12 )V2@) + L 2 

where 



L 2 = E 3l 



N 



2_j l E i(y™ ~ x mP - z m u\x)y 



1 N 

j ^ x m B 1 x I m + 2r j2 5 j3 + 2di 2 (8i4 



Proof. Notice that for any i, j, I 



E jl [Mvi(0\x)\t;'] = v Jl 



^ Ej [(y m - x m /3 - z m uf |A 

m—l 

N 

^ [Ei(j/ m - a: m /3 - z m u|A)] 



m—l 



77).- 



where from Lemmas IA.1I and IA.2I 



Vari(y m - x m f3 - z m u\\) = x m {\ R X T X + B) 1 x T m + z m (\ R Z T Z + \ D I k ) 



< \x m (A^ 1 {X t X) 1 + B- 1 ) x T m + \ D x z m z T m . 



Also, by §S§ we have 



2rji + AT - 2 



2d n + k - 2 



Therefore 



X)m=i E i' [Vari(y m - z m /3 - z m u\\) \ £'] 



16 



N r 



< 



E 

771 — 1 



1_ ^ 2^ + ^(0 
2rji + TV - 2 



2^2+^(0. 

2dn + k - 2 ' 



1 N 



/\\ Z-^dTll — 1 



m—1 
N 



2dn + k - 2 



(16) 



5 i3 (2r ja + + J E ^B'^l + (24a + «a(0) (fo - fa) 



The result holds by combining (|T5|) and (JTo 



□ 



Lemma A. 5. For any I 



where 



Proof. First, for any 



En mv 2 {0\X)\^\ < 5 l2 v 2 (Z') + L 3 



L3 — Eji 



Y^[Ei (u m \X)f 



m—1 



+ 2di 2 5i 2 . 



(17) 



E jt [E i (« a (0|A)|^=E 



•ji 



E 



E E * K\ x ) 

k 



(18) 



^E j7 [Var,K n |A)|a 



Let e m denote the fc x 1 vector with the mth element being 1 and the rest of the elements being 0. 
Thus by Lemma IA.11 



Also, 



Var,(u m |A) = {\ R Z T Z + \ D I k ) 1 e m < \^e T m e m = A^, 1 



2d/i + fc - 2 fc 



(19) 



(20) 



Putting ([l8 |) -(|20 ]) together gives 



E,-, [E 4 ( U2 (0|A)U']<E J7 



= E,- 



]T[m«™i a )] 5 

rn— 1 
k 

E^ 



+ E e j ( [abM^I 

777,-1 

+ fe (2d I2 +«2(0) 



= S l2 v 2 (£') + L 3 . 



□ 
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We are now ready to finish the proof of Proposition 13.21 We consider the case with nonsingular 
Z T Z and the case in which no restrictions are placed on Z separately. 

1. Case 1: Z T Z nonsingular 

Notice that L\ + L3 < L for L\ and L3 given by (fT3")) and (jTT)) . respectively. Then by Lemmas 
IA.3l and IA.5l we have that for any i, j, I 

E jt PWOIA) = E j7 [EMO + «a(0|A) \C] 

< SjivxiO + 5i2v 2 {0 + L x + L 3 (21) 

< 7 v(e) + l . 

Then combining (fT2")l and (|21[) establishes the drift condition. 

2. Case 2: Z T Z is possibly singular 

Observe that L 2 + L3 < L for L 2 and L s given by (fT4|) and (fT7|) . respectively. Further, it follows 
from Lemmas IA.4I and IA.5I that for any I 

E jt mV(m)W] = EjJ [E,(t;i(0 +«a(OW |C] 

< <*W£') + (5/4 - fe)«2(C) + L 2 + Si 2 v 2 (C) + L 3 
= S33V1 (£') + foua(0 + i 2 + L 3 

< 7 y(C) + l . 

Hence the result holds by combining (|12l) and (|2"2"1) . 
A. 2 Proof of Proposition 13.31 

By the assumption that hi = for all i, £|A, 2/ ~ Nfc +J3 (mo, S _1 ) where 



(22) 



sr 1 



m 



{XrZ t Z + 1 

(AiO^AT + B) -1 

Afl^A^Z + Ap/fc)" 1 ^ 
A« (A^X + B)" 1 ^ 



Define A 3 := Ai?X T X + B and A ft := X R Z T Z + X D I k . Then E(/3|A) = X R A~ 1 X T y and E(u|A) = 
\ R A- h 1 Z T y. 

We must establish that there exists K for which 

N k 

[E (y m — £ m /3 — z m w| A, y)] 2 + ^2 [E ( \X,y)] z <K. 
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Let 

/(A) = (y - XE(/?|A) - ZE(u\X)) T (y - XE((3\X) - ZE(u\X)) + E( U |A) T E( U |A) 

and note that the claim will be proven if we can show that /(A) < K for all A. To this end, define 
functions g, and h as 

g(X) = (y - XE(f3\X)) T (y ~ XE(/3\\)) 

h(X) = E(m|A) t Z t ZE(m|A) + E(m|A) t E(m|A) - 2y T ZE(u\X) . 

Since the conditional independence of f3 and u given A implies X T Z = 0, a little algebra shows that 
/(A) = g(X) + h(X). Thus, it suffices to find K g and K h such that for all A, g(X) < K 2 and h(X) < K 2 . 
First, 

.g(A) = y T y + E(/3|A) T A: T A:E(/?|A) - 2y T XE(P\X) 

= V T y + X 2 B fXA~ 1 X T XA- 1 X T y - 2X R y T XA- 1 X T y 
= y T y - X R y T XA- 1 BA- g 1 X T y + X R y T X A^ 1 AgA^ 1 X T y 

- 2X R y T XA- 1 X T y 
= y T y - X R y T XA- 1 BA- 1 X T y - X R y T XA; 1 X T y 

< y T y 

■=K 2 

by the positive definiteness of B and A~ 1 . Notice that this also proves part 1 of the claim. Next, we 
have 

h(X) = X 2 R y T ZA- l Z T ZA- 1 Z T y + X 2 R y T Z A~ 2 Z T y - 2X R y T ZA- 1 Z T y 
= X R y T ZAl 1 A h Al 1 Z T y - X R X D y T Z Aj 2 Z T y + X 2 R y T ZAj 2 Z T y 

-2X R y T ZA- h 1 Z T y 
= (X 2 R - X R X D )y T ZA- 2 Z T y - X R y T ZA^ 1 Z T y. 
Since A^ 1 and A^ 2 are positive semidcfinitc we have 
h(X) < X\y T ZAj 2 Z T y 

= X 2 R y T Z [[X R Z T Z) 2 + X D (2X R Z T Z + X D I k )^j ' Z T y 

< X 2 R y T Z(X R Z T Z)- 2 Z T y 
= y T Z{Z T Z)- 2 Z T y 

— K 2 

where the last inequality holds by Lemma IA.1I The result now follows by setting K 2 = K 2 + K 2 . 
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A. 3 Lemma [A. 61 

Lemma A. 6. The fourth posterior moments o//?o, Pi, and (32 are each finite. 

Proof. We present the proof for fa. The proofs for /3q and /3i are similar. The finiteness of E [/3| | y] 
will follow from establishing that E [/3| | A^, y] is finite since 

E[/3 2 4 |2/] =E[E(l3*\\ R ,y)\y] . 

To this end, recall that 

(3\\ R , y~N ((\ R X T X + B)' 1 (\ R X T y + Bb) , (X R X T X + B) ' 
Also, let fi2 = E(/?2|Ar, and e3 denote a vector of zeroes with a one in the third position. Then 



E 



= 3 [(A fl X T X + B) 33 1 
= 3 [ef (\ R X T X + B)' 1 e 3 

= 3Bn-? 



where the inequality follows from Lemma IA.1I It follows that the fourth (non-central) moment 
E [#f | X R ,y] is finite. □ 
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